A Semi-automatic Adaptive OCR for Digital Libraries

نویسندگان

  • Sachin Rawat
  • K. S. Sesh Kumar
  • Million Meshesha
  • Indraneel Deb Sikdar
  • A. Balasubramanian
  • C. V. Jawahar
چکیده

This paper presents a novel approach for designing a semi-automatic adaptive OCR for large document image collections in digital libraries. We describe an interactive system for continuous improvement of the results of the OCR. In this paper a semi-automatic and adaptive system is implemented. Applicability of our design for the recognition of Indian Languages is demonstrated. Recognition errors are used to train the OCR again so that it adapts and learns for improving its accuracy. Limited human intervention is allowed for evaluating the output of the system and take corrective actions during the recognition process.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive detection of missed text areas in OCR outputs: application to the automatic assessment of OCR quality in mass digitization projects

The French National Library (BnF∗) has launched many mass digitization projects in order to give access to its collection. The indexation of digital documents on Gallica (digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition softwares (OCR). OCR softwares have become increasingly complex systems composed of ...

متن کامل

Performance Characterization and Parallelization of Tesseract Optical Character Recognition on Multicore Architectures

Optical Character Recognition, or OCR, is one of the major topics in computer vision technology. It is widely used in various applications, such as a digital libraries, automatic banking systems, and mailing services. Tesseract OCR Engine, which we evaluate in this paper, is one of renowned OCR programs. It was originally developed by Hewlett Packard Lab between 1985 and 1995, and has been main...

متن کامل

On Automatic Similarity Linking in Digital Libraries

Hypertext links are a powerful extension of standard information retrieval techniques based on query languages. However, the generation of links is often impractical due to large manual and/or computational effort. In this paper, we analyze the effects of two main approaches that aim at a restriction of the necessary efforts: The direct use of OCR-processed documents instead of manually post-pr...

متن کامل

Video OCR for Video Indexing

OCR is a technique that can greatly help to locate the topics of interest in video via the automatic extraction and reading of captions and annotations. Text in video can provide key indexing information. Recognizing such text for search application is critical. Major difficult problem for character recognition for videos is degraded and deformated characters, low resolution characters or very ...

متن کامل

2 Toshio

The automatic extraction and recognition of news captions and annotations can be of great help locating topics of interest in digital news video libraries. To achieve this goal, we present a technique, called Video OCR (Optical Character Reader), which detects, extracts, and reads text areas in digital video data. In this paper, we address problems, describe the method by which Video OCR operat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006